1
From Counts to Context: The NLP Evolution
AI030 Lesson 3
00:00

The evolution of NLP represents a fundamental shift from treating language as discrete, isolated symbols to mapping it into a continuous, multi-dimensional vector space. We have moved from simple feature-based representations to deep semantic maps.

TF-IDF (Sparse) Dimensions = Vocab Size Word2Vec (Distributed) King Queen Apple Dimensions = Latent Features

The Shift in Representation

  • The Statistical Era (Sparse): Early NLP relied on the TF-IDF algorithm. While effective for retrieval, it suffers from the "curse of sparsity." In a TF-IDF system, "Physician" and "Doctor" are orthogonal vectorsβ€”mathematically, they share zero relationship.
  • The Distributed Revolution (NNLM & Word2Vec): Neural Network Language Models introduced dense vectors. Word2Vec (Skip-gram/CBOW) learns that words appearing in similar contexts should be spatial neighbors.
  • Global Statistics (GloVe): Global Vectors bridge the gap by analyzing global co-occurrence across the entire corpus, ensuring distance reflects mathematical semantic similarity.
Deep Insight
The transition from counting occurrences to predicting context allows models to capture nuance. This "Distributed Representation" means a single word's meaning is distributed across hundreds of vector dimensions, each potentially representing a latent semantic feature like gender, royalty, or medical context.